perm filename PATREC[4,KMC] blob sn#110829 filedate 1974-07-08 generic text, type T, neo UTF8
00100	
00200	
00300	      PATTERN-MATCHING RULES FOR THE RECOGNITION OF
00400	         NATURAL LANGUAGE DIALOGUE EXPRESSIONS
00500	
00600			Kenneth Mark Colby
00700	                      and
00800		        Roger C. Parkison
00900	
01000	
01100		            INTRODUCTION
01200	
01300		To recognize something is to identify it as  an  instance  of
01400	the "same again".   This familiarity is possible because of recurrent
01500	characteristics of  the  world  which  repeat  themselves.  We  shall
01600	describe  an  algorithm which recognizes recurrent characteristics of
01700	natural language dialogue  expressions.  It  utilizes  a  multi-stage
01800	sequence  of pattern-matching rules for progressively transforming an
01900	input expression until  it  eventually  matches  an  abstract  stored
02000	pattern.   The stored pattern has a pointer to a response function in
02100	memory which decides what to do once the input has  been  recognized.
02200	Here  we  discuss  only  the  recognizing  functions,  except for one
02300	response function (anaphoric substitution) which  interactively  aids
02400	the  recognition  process.    Details  of  how the response functions
02500	operate will be described in a future communication by William Faught
02600	and ourselves.
02700		We are constructing and  testing  a  simulation  of  paranoid
02800	thought  processes;  our  problem is to reproduce paranoid linguistic
02900	behavior  in  a  teletyped  diagnostic  psychiatric  interview.   The
03000	diagnosis   of  paranoid  states,  reactions  or  modes  is  made  by
03100	clinicians who judge the degree of correspondence between  what  they
03200	observe  in  an  interview  and  their  conceptual  model of paranoid
03300	behavior.   There  exists  a   high   degree   of   agreement   among
03400	psychiatrists about this conceptual model which relies mainly on what
03500	an interviewee says and how he says it.
03600		Natural  language  is a life-expressing code which people use
03700	for  communication  with  themselves  and  others.   In  a  real-life
03800	dialogue  such  as  a  psychiatric  interview,  the participants have
03900	interests, intentions, and expectations which are revealed  in  their
04000	linguistic  expressions.     An  interactive simulation of a paranoid
04100	patient must be  able  to  demonstrate  typical  paranoid  linguistic
04200	behavior.   To  achieve this effect, our paranoid model must have the
04300	ability to deal with the teletyped messages of an interviewer.
04400		A  number  of  approaches  have  been  taken for dealing with
04500	natural language dialogue expressions.  (Winograd,1972;  Woods,1970).
04600	These  approaches  rely on parsers which conduct a detailed syntactic
04700	and semantic analysis. They perform well for the purposes  for  which
04800	they  were  designed. Their weakness, for our purposes, lies in their
04900	lack of neglecting and  ignoring  mechanisms.   Such  mechanisms  are
05000	necessary  in  a  program  which accepts and responds to unrestricted
05100	conversational English characterized  by  expressions  novel  to  the
05200	program.
05300		How humans process natural language is largely unknown.  They
05400	possess  some  knowledge of grammatical rules, but this fact does not
05500	entail  that  they  use  a  grammar  in  interpreting  and  producing
05600	language.  It  seems  implausible  to  us  that  people  possess full
05700	transformational grammars for processing  language.      Language  is
05800	what  is  recognized but the processes involved may not be linguistic
05900	or grammatical.      Originally transformational  grammars  were  not
06000	designed  to "understand" a large subset of English; they constituted
06100	a formal method for deciding whether a string is grammatical.
06200		An analysis of what one's problem actually  is  should  guide
06300	the  selection  or  invention of methods appropriate to its solution.
06400	Our problem is not to develop a  consistent  and  general  theory  of
06500	language  nor  to  assert  empirically  testable hypotheses about how
06600	people process language.    Our problem is  to  design  an  algorithm
06700	which  recognizes  what is being said in a dialogue and what is being
06800	said about it in order to make a response such that a sample  of  I-O
06900	pairs  from  the  paranoid model is judged similar to a sample of I-O
07000	pairs  from  paranoid  patients.      The  design  task  belongs   to
07100	artificial  intelligence in which the criterion is how adequately the
07200	computer program performs mind-like functions.  New methods had to be
07300	devised  for  an  algorithm  to  participate in a human dialogue in a
07400	paranoid-patient-like way.   We sought effective methods which  could
07500	operate  efficiently  in  real  time.     Since our method provides a
07600	general way of many-to-one mapping  from  surface  expressions  to  a
07700	single  stored  pattern,  it  is  not  limited  to  the simulation of
07800	paranoia, but can be used by any type of "host"  system  which  takes
07900	natural language as input.
08000		Our method is to transform  the  input  until  a  pattern  is
08100	obtained which matches completely or partially a more abstract stored
08200	pattern.  This strategy  has  proved  adequate  for  our  purposes  a
08300	satisfactory  percentage  of the time.   The power of this method for
08400	natural  language  dialogues  lies  in  its  ability  to  ignore   as
08500	irrelevant  some  of  what  it  recognizes and everything it does not
08600	recognize  at  all.    A  linguistic   parser   doing   word-by-word,
08700	parts-of-speech analysis fails when it cannot find one or more of the
08800	input words in its dictionary. A system that must know every word  is
08900	too fragile for unrestricted dialogues.
09000		In early versions of the paranoid model, such as PARRY1, some
09100	of  the  pattern  recognition  mechanisms allowed the elements of the
09200	pattern to be order independent (Colby, Weber, and Hilf, 1971).   For
09300	example, consider the following expressions:
09400		(1) WHERE DO YOU WORK?
09500		(2) WHAT SORT OF WORK DO YOU DO? 
09600		(3) WHAT IS YOUR OCCUPATION? 
09700		(4) WHAT DO YOU DO FOR A LIVING? 
09800		(5) WHERE ARE YOU EMPLOYED? 
09900	In  PARRY1  a  procedure  scans  these  expressions  looking  for  an
10000	information-bearing  contentive  such as "work", "for a living", etc.
10100	When it finds such a contentive along with "you"  or  "your"  in  the
10200	expression,  regardless  of word order, it responds to the expression
10300	as if it were a question about the nature of one's work. This  method
10400	correctly  classifies  the  five  sentences  above. Unfortunately, it
10500	includes the two examples below in the same category:
10600		(6) DOES YOUR FATHER'S CAR WORK?
10700		(7) HOW DID THINGS WORK OUT FOR YOU?
10800	An insensitivity to word order has the advantage that  lexical  items
10900	representing  different  parts  of  speech  can  represent  the  same
11000	concept,e.g.  the word "work" represents the same concept whether  it
11100	is  used as a noun or a verb. But a price is paid for this resilience
11200	and elasticity.   We find from experience that, since English  relies
11300	heavily  on  word  order  to  convey the meaning of its messages, the
11400	average  penalty  of  misunderstanding  (to  be  distinguished   from
11500	ununderdstanding),  is  too  great.   Hence  in  PARRY2,  as  will be
11600	described shortly, all the patterns require a specified word order.
11700		For   high-complexity   problems   it   is  helpful  to  have
11800	constraints.        Diagnostic psychiatric interviews (and especially
11900	those  conducted  over  teletypes)  have several natural constraints.
12000	First, clinicians are trained to ask  certain  questions  in  certain
12100	ways.     This  limits  the  number of patterns required to recognize
12200	utterances about each topic.  Second, only  a  few  hundred  standard
12300	topics  are  brought up by interviewers who are, furthermore, trained
12400	to use everyday expressions and especially those used by the  patient
12500	himself.   When  the interview is conducted by teletypes, expressions
12600	tend to be shortened since the  interviewer  tries  to  increase  the
12700	information  transmission  rate  over the slow channel of a teletype.
12800	Finally,  teletyped  interviews  represent  written  utterances   and
12900	utterances  are  known  to be highly redundant such that unrecognized
13000	words can be ignored without losing the meaning of the message.  Also
13100	utterances  are  loaded with idioms, cliches, pat phrases, etc. - all
13200	being  easy  prey  for  a   pattern-matching   approach.      It   is
13300	time-wasting  and  usually  futile  to  try  to  decode  an  idiom by
13400	analyzing the meanings of its individual words.
13500		We  now  describe  the  pattern-matching  functions  of   the
13600	algorithm  in  some  detail. (See Fig. 1 for a diagram of the overall
13700	flow of control).
13800	
13900	
14000			    OVERVIEW
14100	
14200		PARRY2 has  two  primary  modules.   The  first  attempts  to
14300	RECOGNIZE the input and the second RESPONDS.  This paper is primarily
14400	about the  RECOGNIZE  module.   It  functions  independently  of  the
14500	RESPOND  module  except  in the case of pronoun references, which the
14600	RESPOND module provides to the RECOGNIZER on request.
14700		The recognition module has 4 main steps:
14800		1) Identify the words in the question and convert them to
14900			internal synonyms.
15000		2) Break the input into segments at certain bracketing words.
15100		3) Match each segment (independently) to a stored pattern.
15200		4) Match the resulting list of recognized segments to a stored
15300			  compound pattern.
15400	Each  of  these  steps,  except  the  segmenting, throws away what it
15500	cannot identify.  Occasionally a reference to  an  unknown  topic  is
15600	mis-recognized as some familiar topic.
15700	
15800	 		    PREPROCESSING
15900	
16000		Each  word  in  the  input expression is first looked up in a
16100	dictionary of (currently) about 1900 entries which, for the  sake  of
16200	speed,  is  maintained  in core during run-time.   (The dictionary is
16300	given in Appendix 2.)  The dictionary, which  was  built  empirically
16400	from  thousands of teletyped interviews with previous versions of the
16500	model, consists of words, groups of words, and names of  word-classes
16600	they  can  be  translated  into.   The  size  of  the  dictionary  is
16700	determined by the patterns, i.e. a word is  not  included  unless  it
16800	appears  in  one or more patterns.  Entries in the dictionary reflect
16900	PARRY2's main interests.  If a word  in  the  input  is  not  in  the
17000	dictionary,  it  is  checked to see if it ends with one of the common
17100	suffixes given in Fig. 2.  If it does, the suffix is removed and  the
17200	remaining  word  is  looked  up  again.   If  it  is still not in the
17300	dictionary, it is dropped from the pattern being formed. Thus if  the
17400	input is:
17500		WHAT IS YOUR CURRENT OCCUPATION?
17600	and  the  word  "current"  is   not in the dictionary, the pattern at
17700	this stage becomes:
17800		( WHAT IS YOUR OCCUPATION )   
17900	The question-mark is thrown away as  redundant  since  questions  are
18000	recognized  by  word  order. (A statement followed by a question mark
18100	(YOU GAMBLE?) is responded to in  the  same  way  as  that  statement
18200	followed  by  a  period.) Synonymic translations of words are made so
18300	that the pattern becomes, for example:
18400		( WHAT  BE  YOU  JOB )   
18500		Some groups of words (i.e. idioms) are translated as a  group
18600	so that, for example, "for a living" becomes "for job". Certain other
18700	juxtaposed words are contracted into a single word,  e.g.  "place  of
18800	birth"  becomes  "birthplace".  This  is  done to deal with groups of
18900	words which are  represented  as  a  single  element  in  the  stored
19000	pattern,  thereby preventing segmentation from occurring at the wrong
19100	places, such as at a preposition inside an idiom or  phrase.  Besides
19200	these  contractions, certain expansions are made so that for example,
19300	"DON'T" becomes "DO NOT" and "I'D" becomes "I WOULD".
19400		Misspellings  can  be the bane of teletyped interviews for an
19500	algorithm.  Here  they  are  handled  in  two  ways.  First,   common
19600	misspellings  of  important  words  are simply put in the dictionary.
19700	Thus "yuu" is known to mean "you". The apostrophe  is  often  omitted
19800	from contractions so most contractions are recognized with or without
19900	it. These common misspellings were gathered from over 4000 interviews
20000	with earlier versions of the paranoid model.
20100		Second,  five  common  forms  of  typing  error  are  checked
20200	systematically.  These are:
20300		1) Doubled letter
20400		2) Extraneous letter
20500		3) Forgetting to hold the "shift key" for an apostrophe
20600		4) Hitting a nearby key on the keyboard
20700		5) Transposing two letters in a word
20800	The first three errors can be corrected  by  deleting  the  offending
20900	character  from  the  word.     This is accomplished by deleting each
21000	character in turn until the word is recognized.  The fourth  type  of
21100	error is only checked for eight of the more common near misses. These
21200	were also empirically determined and involve the letter pairs (T  Y),
21300	(Q  W), (Y U), (I O), (G H), (O P), (A S), and (N M).   These methods
21400	are all based on typing errors, but they also correct some legitimate
21500	English  spelling  errors.     Two-letter transposition corrects, for
21600	example, "beleive" to "believe".
21700	
21800	                      SEGMENTING
21900	
22000		Another  weakness  in the crude pattern matching of PARRY1 is
22100	that it takes the entire input expression  as  its  basic  processing
22200	unit.   If  only two words are recognized in an eight word utterance,
22300	the risk of misunderstanding is great. We need a way of dealing  with
22400	units shorter than the entire input expression.
22500		Aided by a heuristic from work in machine-translation (Wilks,
22600	1973 ), we devised a way of  bracketing the pattern constructed up to
22700	this  point  into  shorter  segments  using  prepositions,  wh-forms,
22800	certain  verbs, etc. as bracketing points.  (A list of the bracketing
22900	terms  appears  in  Fig.  3).    These  points   tend   to   separate
23000	prepositional  phrases  and  embedded  clauses  from the main clause.
23100	The  new  pattern  formed  is  termed  either  "simple",  having   no
23200	delimiters  within  it,  or "compound", i.e., being made up of two or
23300	more simple patterns. A simple pattern might be:
23400		( WHAT BE YOU JOB )
23500	whereas a compound pattern would be:
23600		(( WHY BE YOU ) ( IN HOSPITAL ))
23700	Our experience with this method of segmentation shows  that  compound
23800	patterns  from teletyped psychiatric dialogues rarely consist of more
23900	than three or four segments. 
24000		After certain verbs (See  Fig.  4)  a  bracketing  occurs  to
24100	replace the commonly omitted "THAT", such that:
24200		( I THINK YOU BE AFRAID )
24300	becomes
24400		(( I THINK ) ( YOU BE AFRAID ))
24500	
24600			   MATCHING INDIVIDUAL SEGMENTS
24700	
24800		Conjunctions serve only as markers for the segmenter and they
24900	are dropped out after segmentation.
25000		Negations are  handled  by  extracting  the  "NOT"  from  the
25100	segment  and  assigning  a value to a global variable which indicates
25200	that the expression is negative in form.  When a  pattern  is finally
25300	matched,  this variable is consulted. Some patterns have a pointer to
25400	a pattern  of  opposite  meaning  if  a  "NOT"  could  reverse  their
25500	meanings.      If this pointer is present and a "NOT" was found, then
25600	the pattern matched is replaced by its opposite, e.g.  ( I not  trust
25700	you ) is replaced by  the pattern ( I mistrust you ). We have not yet
25800	observed the troublesome  case  of  "he  gave  me  not  one  but  two
25900	messages". (There is no need to scratch where it doesn't itch).
26000		Substitutions  are also made in certain cases.  Some segments
26100	contain pronouns which could stand for a number of  different  things
26200	of  importance  to PARRY2.       As we mentioned in the introduction,
26300	the response functions of memory keep track of the context  in  order
26400	to  give  pronouns and other anaphoras a correct interpretation.  For
26500	example, the segment:
26600		( DO YOU AVOID THEM )
26700	could refer to the Mafia, or racetracks, or other patients, depending
26800	on  the  context.  When such a segment is encountered, the pronoun is
26900	replaced by its current anaphoric value as determined by the response
27000	functions, and a more specific segment such as:
27100		( DO YOU AVOID MAFIA )
27200	is  looked  up.
27300		Other  utterances,  such  as  "Why  did you do that?" or just
27400	"Why?" (which might be regarded as a massive ellipsis), clearly refer
27500	back  to  previous  utterances.   These utterances match very general
27600	patterns which identify the type of question without  indicating  the
27700	exact  topic. The response function which responds to "Why?" consults
27800	the context to produce an appropriate answer.
27900		The algorithm next attempts to match the segments with stored
28000	simple  patterns  which  currently  number  about  1700.  (The simple
28100	patterns appear in Appendix 3). First a complete and perfect match is
28200	sought.    When  a  match  is  found,  the  stored pattern name has a
28300	pointer to the name of a response function in  memory  which  decides
28400	what to do further.  If a match is not found, further transformations
28500	of the segment are carried out and a "fuzzy" match is tried.
28600		For fuzzy matching at this stage, we  adopted  the  heuristic
28700	rule of dropping elements in the segment one at a time and attempting
28800	a match each time.   This heuristic allows ignoring familiar words in
28900	unfamiliar  contexts.    For example, "well" is important in "Are you
29000	well?" but meaningless in "Well are you?".
29100	 	Deleting one element at a time results  in,  for example, the
29200	pattern:
29300			( WHAT BE YOU MAIN PROBLEM )
29400	becoming successively:
29500			(a) ( BE YOU MAIN PROBLEM )
29600			(b) ( WHAT YOU MAIN PROBLEM )
29700			(c) ( WHAT BE MAIN PROBLEM )
29800			(d) ( WHAT BE YOU PROBLEM )
29900			(e) ( WHAT BE YOU MAIN )
30000	Since the stored pattern in this case matches (d), (e) would  not  be
30100	constructed.   We  found  it  unwise  to delete more than one element
30200	since our segmentation method usually yields  segments  containing  a
30300	small number (1-4) of words.
30400		Dropping  an  element  at  a  time  provides  a   probability
30500	threshold  for  fuzzy   matching which is a function of the length of
30600	the segment. If a segment consists of five elements, four of the five
30700	must be present in a particular order (with the fifth element missing
30800	in any position) for a match to occur. If  a  segment  contains  four
30900	elements, three must match - and so forth.
31000	
31100			   COMPOUND-PATTERN MATCH
31200	
31300		When more than one simple pattern is detected in the input, a
31400	second  matching  is  attempted  against about 500 compound patterns.
31500	Certain  patterns,  such as ( HELLO ) and ( I  THINK ),  are  dropped
31600	because  they  are considered meaningless. If a complete match is not
31700	found, then simple patterns are dropped, one  at  a  time,  from  the
31800	complex pattern. This allows the input,
31900		(( HOW DO YOU COME ) ( TO BE ) ( IN HOSPITAL ))
32000	to  match  the  stored  pattern,
32100		(( HOW  DO  YOU COME ) ( IN HOSPITAL )).
32200	
32300		If  no  match  can  be found at this point, the algorithm has
32400	arrived at a default condition and the appropriate response functions
32500	decide  what  to  do.  For example, in a default condition, the model
32600	may assume  control  of  the  interview,  asking  the  interviewer  a
32700	question, continuing with the topic under discussion or introducing a
32800	new topic.
32900		An annotated example of a diagnostic psychiatric interview is
33000	presented in Appendix 1.
33100	
33200	
33300			   ADVANTAGES AND LIMITATIONS
33400	
33500		As   mentioned,   one   of   the   main   advantages   of   a
33600	pattern-matching strategy is that it can ignore  as  irrelevant  both
33700	some  of  what  it  recognizes and what it does not recognize at all.
33800	There are several million words in English, each possessing from  one
33900	to  over  a  hundred  senses.     To  construct a machine-usable word
34000	dictionary of this magnitude is out of the  question  at  this  time.
34100	Recognition  of  natural language input in the manner described above
34200	allows real-time interaction in a dialogue since it  avoids  becoming
34300	ensnarled   in  combinatorial  disambiguations  and  long  chains  of
34400	inferencing  which  would  slow  a   dialogue   algorithm   down   to
34500	impracticality,  if it could even function at all. The price paid for
34600	pattern-matching is that  sometimes,  but  rarely,  ambiguities  slip
34700	through.
34800		Another  advantage of this method is its speed. The algorithm
34900	consists of about 28K of programs written in MLISP, 16K  of  data  in
35000	LISP,  and 16K of data in machine language with several overlays. The
35100	complete language recognition process requires less than  one  second
35200	of real time on a time-shared DEC PDP-10.
35300		A drawback to PARRY1 is that it reacts to the  first  pattern
35400	it  finds  in the input rather than characterizing the input as fully
35500	as possible and then deciding what to do based on a number of  tests.
35600	Another   practical   difficulty  with  PARRY1  from  a  programmer's
35700	viewpoint, is that, since it is a procedural model, elements  of  the
35800	patterns   are  strung  out  in  various  procedures  throughout  the
35900	algorithm.    It is often a considerable chore for the programmer  to
36000	determine  whether  a given pattern is present and precisely where it
36100	is.   In PARRY2 the patterns are all collected in  one  part  of  the
36200	data-base where they can easily be examined.
36300		Concentrating all the patterns in the data base gives  PARRY2
36400	a  limited  "learning"  ability.   When  an  input fails to match any
36500	stored pattern or matches an incorrect one,  as  judged  by  a  human
36600	operator,  a  pattern  which  matches  the  input can be put into the
36700	data-base automatically.  If the new pattern has the same meaning  as
36800	a previously stored pattern, the human operator must provide the name
36900	of the appropriate response function.  If  he  doesn't  remember  the
37000	name,  he  may  try  to  rephrase the input in a form recognizable to
37100	PARRY2 and it will name the response  function  associated  with  the
37200	rephrasing.  These mechanisms are not "learning" in the commonly-used
37300	sense but they do allow a  person  to  transfer  his  knowledge  into
37400	PARRY2's data-base with very little  effort.
37500		Informal  observation  thus  far  shows  PARRY2's  linguistic
37600	recognition  abilities  to  be  quite  superior  to  PARRY1's. A more
37700	systematic and quantitative evaluation of performance  is  now  being
37800	carried  out.   PARRY1  was  extensively tested by having judges make
37900	ratings of its performance along several dimensions, one of which was
38000	linguistic noncomprehension (Colby and Hilf, 1974). These judges also
38100	made ratings of teletyped interviews with  psychiatric  patients  and
38200	with a random version of PARRY1. The mean ratings of PARRY1 along the
38300	dimension of  linguistic  noncomprehension  were  better  than  those
38400	received  by  RANDOM-PARRY  but  were three times worse than the mean
38500	ratings received by patients. Once the ratings of PARRY2  along  this
38600	dimension  are  completed, we will be able to compare them with those
38700	of PARRY1 and the patients and obtain a  more  objective  measure  of
38800	improvement.
38900	
39000	
39100	                  REFERENCES
39200	
39300	Colby, K.M., Weber, S., and Hilf, F.D. (1971). Artificial Paranoia.
39400		ARTIFICIAL INTELLIGENCE, 2, 1-25.
39500	
39600	Colby, K.M. and Hilf, F.D. (1974). Multidimensional Evaluation of
39700	       of a Computer Simulation of Paranoid Thought. To appear
39800	       in KNOWLEDGE AND COGNITION, L. Gregg, (Ed.)
39900	
40000	Wilks, Y. (1973). An Artificial Intelligence Approach to Machine
40100		Translation. In COMPUTER MODELS OF THOUGHT AND 
40200		LANGUAGE, R.C.Schank and K.M. Colby, Eds., W.H. Freeman,
40300		San Francisco.
40400	
40500	Winograd, T.A. (1972). A Program for Understanding Natural
40600	        Language, COGNITIVE PSYCHOLOGY, 3, 1-191.
40700	
40800	Woods, W.A. Transition Network Grammars for Natural Language
40900	        Analysis. COMMUNICATIONS OF THE ACM, 13, 591-606.